#### EXPERIMENT #3

## Introduction to SystemVerilog, FPGA, CAD, and 16-bit Adders

#### I. OBJECTIVE

In this experiment you will transition from breadboard TTL (transistor-transistor logic) elements to RTL (register-transfer level) design on an FPGA using SystemVerilog. You will come to understand the basic syntax and constructs of SystemVerilog, as well as acquire the basic skill required to operate Quartus Prime, a CAD tool for FPGA synthesis and simulation. Quartus Prime's performance analysis and optimization tools will be explored in the process of implementing three types of adders: a carry-ripple adder, a carry-lookahead adder, and a carry-select adder. This performance analysis and optimization will look at the various adders' area, power, and maximum operating frequencies.

## II. INTRODUCTION

If you have not done so already, please read the <u>INTRODUCTION TO SYSTEMVERILOG AND TUTORIAL</u> and the <u>INTRODUCTION TO QUARTUS PRIME</u>

In addition to the standard synthesis and simulation capability, Quartus Prime provides a variety of compiler settings for the designer to tweak for the synthesis and compilation process. Depending on the settings the designer can gear the generated circuit to comply with some predefined constraints or performance criteria, such as the maximum operating frequency of the circuit, the maximum area of the circuit layout, or the maximum static or dynamic power consumed by the circuit.

During the synthesis and compilation process, Quartus Prime collects a variety of analysis data and display them in the generated Compilation Report. These data are important to the designer in the sense that the designer relies on these data to determine if his or her circuit has met the performance constraints. If the analysis result is far off from the performance criteria, the designer will most likely have to modify the circuit from the designing aspect of the circuit. On the other hand, if the analysis result is just slightly below the performance criteria, then the designer can use many of the built-in tools to optimize the circuit during the compilation process to meet the performance criteria.

Quartus Prime offers a variety of optimization tools, such as TimeQuest Timing Analyzer for the timing constraint, PowerPlay Power Analyzer for the power constraint, and a built-in placement fitter for the area constraint. Many of the optimization steps can be done by simply changing the various synthesis and compilation settings, as suggested by the Quartus Prime Optimization Advisors, some of the in-depth optimization and analysis can only be done by providing specific constraints to the analyzers.

In most industry practices, circuit implementation on FPGA is usually only a small portion of the entire design, where the circuit on FPGA will interface with external circuits through its inputs and outputs. These external circuits will have their own performance constraints which the FPGA circuit has to follow to be integrated. To incorporate these external constraints into the FPGA design, they are written into constraint files such as the Synopsys Design Constraint (SDC) format as input to the Quartus Prime Analyzers, where the analyzers will then be able to analyze and optimize the circuit based on the provided constraints.

To read more about the optimization process in Quartus Prime, please refer to Section III in Volume 2 of the Quartus Prime Standard Edition Handbook (currently v18.1, accessible on the Intel FPGA website) Chapter 10 gives a design optimization overview, Chapter 12-14 discuss timing, power, and area optimization, respectively.

For this lab, we will consider the design of a binary adder. Binary adders are a key component of logic circuits. They are used not only in the arithmetic logic units (ALU) for data processing but are also used in other parts of a logic processor to calculate addresses and signal evaluations. An N-bit binary adder takes two binary numbers (A and B) of size N and a carry-in ( $C_{in}$ ) as inputs, sum up the three values, and produces a sum (S) and a carry-out ( $C_{out}$ ), as shown in Figure 1.



Figure 1: N-bit Binary Adder Block Diagram

Among the many different binary adder designs, the most straightforward one is the Carry-Ripple Adder (CRA). It is constructed using N full-adders. A full-adder is a single-bit version of the binary adder, where three binary bits  $(A, B \text{ and } C_{in})$  are inputted through a set of logic gates to produce a single-bit sum (S) and a single-bit carry-out  $(C_{out})$ , as shown in Figure 2. The N full-

adders are then linked together in series through the carry bits, forming an N-bit binary adder. When the binary inputs are provided, the full-adder of the least significant bit (LSB) will produce a sum ( $S_0$ ) and a carry-out ( $C_1$ ). The carry-out is fed to the carry-in of the second full-adder, which then produces a second sum ( $S_1$ ) and a second carry-out ( $C_2$ ). The process ripples through all N bits of the adder as shown in Figure 3, and settles when the full-adder of the most significant bit (MSB) outputs its sum ( $S_{N-1}$ ) and carry-out ( $C_{out}$ ).



Figure 2: Full-Adder Block Diagram



Figure 3: N-bit Carry-Ripple Adder Block Diagram

The CRA is simple in the design and straightforward to implement, but the long computation time is its drawback. Every full-adder has to wait for their lower-bit neighbor to produce a carry-out before it can correctly compute its sum and carry-out. This means that the propagation delay of the CRA increases with *N*. If one wishes to reduce the computation time, it is apparent that the computation of the carry-out bits has to be somehow parallelized. And this is precisely how a carry-lookahead adder operates.

Instead of waiting on the actual carry-in values, Carry-Lookahead Adder (CLA) uses the concept of *generating* (G) and *propagating* (P) logic. The concept is that every bit of the CLA makes predictions using its immediate available inputs (A and B), and predicts what its carry-out would be for any value of its carry-in. A carry-out is *generated* (G) if and only if both available inputs (A and B) are 1, regardless of the carry-in. The equation is  $G(A,B) = A \cdot B$ . On the other hand, a carry-out has the possibility of being *propagated* (P) if either A or B is 1, which is written as  $P(A,B) = A \oplus B$ . With P and G defined, the Boolean expression for the carry-out  $C_{i+1}$  giving a potential  $C_i$  is then  $C_{i+1} = G_i + (P_i \cdot C_i)$ . Notice that  $C_{i+1}$  can be expressed in terms of  $C_i$  which in turn can be expressed in terms of  $C_{i-1}$ . However, if  $C_{i+1}$  still depends on  $C_i$ , it will behave like a ripple adder without giving any gain in speed. Therefore, to avoid the slow rippling of the carry bits, the expression of  $C_{i+1}$  should be expanded and computed directly from  $C_i$  s. For example,

$$C_{0} = C_{in}$$

$$C_{1} = C_{in} \cdot P_{0} + G_{0}$$

$$C_{2} = C_{in} \cdot P_{0} \cdot P_{1} + G_{0} \cdot P_{1} + G_{1}$$

$$C_{3} = C_{in} \cdot P_{0} \cdot P_{1} \cdot P_{2} + G_{0} \cdot P_{1} \cdot P_{2} + G_{1} \cdot P_{2} + G_{2}$$

In this way, the computation time of the CLA is much faster than that of the CRA, resulting in a higher operating frequency. The downside of the CLA is its additional logic gates, which increases both the area and power consumption of the adder.



Figure 4: N-bit Carry-Lookahead Adder Block Diagram

To build an arbitrarily long N-bit CLA, one might be tempted to directly follow the above 'flat' approach. However, from the explicit expansion of  $C_i$ , you can find that the number of gates

involved for an increasing *N* will soon grow too large for the CLA to be practical. And thus, it is a common practice to first construct 4-bit CLAs, then use them to create a larger CLA in a hierarchical fashion. In this lab, the CLA should be implemented in 4x4-bit instead of 16-bit.

In the 4x4-bit hierarchical CLA design, the 16-bit inputs A and B are divided into groups of 4 bits. First, each group of 4 bits go through a 4-bit CLA, which is illustrated by Figure 4 with N=4. Note that the 4-bit CLA generates two additional output signals, the group propagate ( $P_G$ ) and the group generate ( $G_G$ ), with their logics being:

$$P_G = P_0 \cdot P_1 \cdot P_2 \cdot P_3$$

$$G_G = G_3 + G_2 \cdot P_3 + G_1 \cdot P_3 \cdot P_2 + G_0 \cdot P_3 \cdot P_2 \cdot P_1$$

We will denote the  $P_{GS}$  and  $G_{GS}$  from these four 4-bit CLAs as  $P_{G0}$ ,  $P_{G4}$ ,  $P_{G8}$ ,  $P_{G12}$ , and  $G_{G0}$ ,  $G_{G4}$ ,  $G_{G8}$ ,  $G_{G12}$  from this point on.

Next, a tempting design is to cascade the four 4-bit CLAs by connecting the  $C_{out}$  from the previous 4-bit CLA to the  $C_{in}$  of the next 4-bit CLA, but in this way we will be trapped by the slow rippling of these carry bits again. Therefore, instead of using the  $C_{out}$  from the previous 4-bit CLA, we should generate the  $C_{in}$ s of the 4-bit CLAs using the  $P_G$ s and  $G_G$ s, as shown by the formulas below.

$$\begin{split} C_4 &= G_{G0} + C_0 \cdot P_{G0} \\ C_8 &= G_{G4} + G_{G0} \cdot P_{G4} + C_0 \cdot P_{G0} \cdot P_{G4} \\ C_{12} &= G_{G8} + G_{G4} \cdot P_{G8} + G_{G0} \cdot P_{G8} \cdot P_{G4} + C_0 \cdot P_{G8} \cdot P_{G4} \cdot P_{G0} \\ \dots \end{split}$$

Does this look familiar to you? Observe that this is the same as how we generated the carry bits within a 4-bit CLA. Therefore, we can directly take a copy of the 4-bit Carry-Lookahead Unit (CLU, red block in Figure 4) in the 4-bit CLA, but instead of the inputs coming from full adders, this time the inputs are the  $P_G$ s and  $G_G$ s from the 4-bit CLAs at the upper level. Figure 5 illustrates the resulting 4x4-bit hierarchical CLA.

This explains why this design is called *hierarchical*. If we add another layer to the hierarchy and use four 4x4-bit hierarchical CLAs and another 4-bit CLU, we can make a 4x4x4-bit hierarchical CLA, namely a 64-bit adder, without any issue of the slow rippling of the carry bits!



Figure 5: A 4x4-bit Hierarchical Carry-Lookahead Adder Block Diagram

Carry-Select Adder (CSA) features another way to speed up the carry computation. It consists of two full adders (or CRAs if multiple bits are grouped) and a multiplexor. One adder computes the sum and carry-out based on the assumption that the carry-in is 0, and the other assumes that the carry-in is 1. In this way, both possible outcomes are pre-computed. Once the real carry-in arrives, the corresponding sum and carry-out is selected to be delivered to the next stage. By paying the price of almost twice the numbers of adders, we gain some speedup (how exactly do we gain this speedup – we will discuss this in lecture, but you should make sure you understand and explain in your own words for your lab report!)

In this lab, you are going to design a 16-bit CSA with 4x4-bit hierarchical structure as illustrated by Figure 5. For each group of 4-bit inputs, we use two CRAs to calculate two versions of the results, one with carry-in bit assumed to be 0 and the other to be 1. Note that the lowest significant group requires only one CRA, since its carry-in bit is directly available. Therefore, eventually the 16-bit CSA will contain seven 4-bit CRAs.



Figure 5: 16-bit Carry-Select Adder Block Diagram

## Your circuits should have the following inputs and outputs:

```
Inputs

Clk, Reset_Clear, Run_Accumulate – logic
SW – logic [9:0]

Outputs

CO – logic
LED – logic[9:0]

HEX0, HEX1, HEX2, HEX3, HEX4, HEX5 – logic [6:0]

Internal Registers

In – logic [16:0] holds either switch or the sum
Out – logic [16:0] holds output of one operation
S– logic [16:0] holds the adder sum result
extended_SW– logic [15:0] extends the 10 bits switch to 16 bits
```

SW[9:0] should come from on-board switches and its value should be displayed on AHex0, AHex1, and LED[1:0]. There are 10 switches on DE10 board. AHex0 and AHex1 cover the lowest 8-bits. The most two significant bits are displayed on LED[1:0]. There are only two buttons, Reset\_Clear and Run\_ Accumulate. At other times, the registers extended\_SW [15:0] constantly load the values of SW[9:0] which serve as *A*, the other number to be added, while the displayed value of the accumulator (B). The CO should be displayed on LED[9] to indicate overflow. When Run\_Accumulate is pressed, S[15:0] and CO should be updated with the result of adding extended\_SW[15:0] (*A*) and the updated Out[15:0] (*B*). The New sum result will be written into Out[15:0]. Each press on Run\_Accumulate accumulates the switch value to the previous sum result. Reset\_Clear should clear all the registers. To achieve optimal speed and resource usage balance, the CLA and CSA will also need to be built in a hierarchical fashion. In this lab, they should be implemented in 4x4-bit instead of 16-bit. Also, for this lab, Cin (carry-in) may be assumed to be 0.

A test platform is required to demo your adders as there are not enough switches on the DE10 board. This platform is provided in the included Lab 3 files on the website, and it should be clear where to place your code for the three types of adders you will design. Registers extended\_SW and Out store the operands to be added, depending on whether Run\_Accumulate is pressed (extended\_SW is continuously loaded from the switches on every cycle). Upon pressing the 'Run\_Accumulate' button, the state machine will load the resulting sum (A+B) into the 16-bit Out register to display. The load and run operation will be executed only once when the Run\_Accumulate button is pressed each time, respectively. The circuit should be able to run multiple times without resetting the circuit before each operation.

## III. PRE-LAB

- A. Design, document, and implement a 16-bit carry-ripple adder, a 16-bit carry-lookahead adder, and a 16-bit carry-select adder in SystemVerilog. Use the provided code (from the website) as a testing framework.
- B. Document design analysis for the three adders in the table below. Plot out the data from the table for comparison studies. Normalize the data across the three adders with the carry-ripple adder. When normalizing, choose data from one the carry-ripple adder as the baseline, and then divide the other two with the baseline number. Say, you got 20 from carry-ripple, 21 from carry-select, and 23 from carry-lookahead, the numbers after normalization becomes 20/20=1.0, 21/20=1.05, 23/20=1.15, respectively. The resulting plot should resemble the one below (the plot below does not use real data).

|               | Carry-Ripple | Carry-Select | Carry-Lookahead |
|---------------|--------------|--------------|-----------------|
| Memory (BRAM) |              |              |                 |
| Frequency     |              |              |                 |
| Total Power   |              |              |                 |



## You will need to bring the following to the lab:

Your code for the 3, 16-bit adders with a project ready to synthesize and test on the FPGA board, be prepared to show your TA each adder's code to verify they are indeed performing according to design.

## **Demo Points Breakdown:**

1.0 point: Correct operation of the Carry-Ripple Adder on the DE2 board

2.0 point: Correct operation of the Carry-Lookahead Adder on the DE2 board using a 4x4 **hierarchical design** (TA's will look at code)

2.0 point: Correct operation of the Carry-Select Adder on the DE2 board using a **4x4 hierarchical design** (TA's will look at code)

## IV. <u>LAB</u>

Follow the Lab 3 demo information on the course website. The pin assignment table is included below. Note that you may export/import the pin assignments for use in future labs.

**Pin Assignment Table** 

| Fin Assignment Table |          |                                        |  |  |
|----------------------|----------|----------------------------------------|--|--|
| Port Name            | Location | Comments                               |  |  |
| SW[0]                | PIN_C10  | On-board slider switch (SW0)           |  |  |
| SW[1]                | PIN_C11  | On-board slider switch (SW1)           |  |  |
| SW[2]                | PIN_D12  | On-board slider switch (SW2)           |  |  |
| SW[3]                | PIN_C12  | On-board slider switch (SW3)           |  |  |
| SW[4]                | PIN_A12  | On-board slider switch (SW4)           |  |  |
| SW[5]                | PIN_B12  | On-board slider switch (SW5)           |  |  |
| SW[6]                | PIN_A13  | On-board slider switch (SW6)           |  |  |
| SW[7]                | PIN_A14  | On-board slider switch (SW7)           |  |  |
| SW[8]                | PIN_B14  | On-board slider switch (SW8)           |  |  |
| SW[9]                | PIN_F15  | On-board slider switch (SW9)           |  |  |
| LED[0]               | PIN_A8   | On-Board LED (LED0)                    |  |  |
| LED[1]               | PIN_A9   | On-Board LED (LED1)                    |  |  |
| LED[2]               | PIN_A10  | On-Board LED (LED2)                    |  |  |
| LED[3]               | PIN_B10  | On-Board LED (LED3)                    |  |  |
| LED[4]               | PIN_D13  | On-Board LED (LED4)                    |  |  |
| LED[5]               | PIN_C13  | On-Board LED (LED5)                    |  |  |
| LED[6]               | PIN_E14  | On-Board LED (LED6)                    |  |  |
| LED[7]               | PIN_D14  | On-Board LED (LED7)                    |  |  |
| LED[8]               | PIN_A11  | On-Board LED (LED8)                    |  |  |
| LED[9]               | PIN_B11  | On-Board LED (LED9)                    |  |  |
| HEX0[0]              | PIN_C14  | On-Board seven-segment display segment |  |  |
| HEX0[1]              | PIN_E15  | On-Board seven-segment display segment |  |  |
| HEX0[2]              | PIN_C15  | On-Board seven-segment display segment |  |  |
| HEX0[3]              | PIN_C16  | On-Board seven-segment display segment |  |  |
| HEX0[4]              | PIN_E16  | On-Board seven-segment display segment |  |  |
| HEX0[5]              | PIN_D17  | On-Board seven-segment display segment |  |  |
| HEX0[6]              | PIN_C17  | On-Board seven-segment display segment |  |  |
| HEX1[0]              | PIN_C18  | On-Board seven-segment display segment |  |  |
| HEX1[1]              | PIN_D18  | On-Board seven-segment display segment |  |  |
| HEX1[2]              | PIN_E18  | On-Board seven-segment display segment |  |  |

| HEX1[3]        | PIN_B16 | On-Board seven-segment display segment     |
|----------------|---------|--------------------------------------------|
| HEX1[4]        | PIN_A17 | On-Board seven-segment display segment     |
| HEX1[5]        | PIN_A18 | On-Board seven-segment display segment     |
| HEX1[6]        | PIN_B17 | On-Board seven-segment display segment     |
| HEX2[0]        | PIN_B20 | On-Board seven-segment display segment     |
| HEX2[1]        | PIN_A20 | On-Board seven-segment display segment     |
| HEX2[2]        | PIN_B19 | On-Board seven-segment display segment     |
| HEX2[3]        | PIN_A21 | On-Board seven-segment display segment     |
| HEX2[4]        | PIN_B21 | On-Board seven-segment display segment     |
| HEX2[5]        | PIN_C22 | On-Board seven-segment display segment     |
| HEX2[6]        | PIN_B22 | On-Board seven-segment display segment     |
| HEX3[0]        | PIN_F21 | On-Board seven-segment display segment     |
| HEX3[1]        | PIN_E22 | On-Board seven-segment display segment     |
| HEX3[2]        | PIN_E21 | On-Board seven-segment display segment     |
| HEX3[3]        | PIN_C19 | On-Board seven-segment display segment     |
| HEX3[4]        | PIN_C20 | On-Board seven-segment display segment     |
| HEX3[5]        | PIN_D19 | On-Board seven-segment display segment     |
| HEX3[6]        | PIN_E17 | On-Board seven-segment display segment     |
| HEX4[0]        | PIN_F18 | On-Board seven-segment display segment     |
| HEX4[1]        | PIN_E20 | On-Board seven-segment display segment     |
| HEX4[2]        | PIN_E19 | On-Board seven-segment display segment     |
| HEX4[3]        | PIN_J18 | On-Board seven-segment display segment     |
| HEX4[4]        | PIN_H19 | On-Board seven-segment display segment     |
| HEX4[5]        | PIN_F19 | On-Board seven-segment display segment     |
| HEX4[6]        | PIN_F20 | On-Board seven-segment display segment     |
| HEX5[0]        | PIN_J20 | On-Board seven-segment display segment     |
| HEX5[1]        | PIN_K20 | On-Board seven-segment display segment     |
| HEX5[2]        | PIN_L18 | On-Board seven-segment display segment     |
| HEX5[3]        | PIN_N18 | On-Board seven-segment display segment     |
| HEX5[4]        | PIN_M20 | On-Board seven-segment display segment     |
| HEX5[5]        | PIN_N19 | On-Board seven-segment display segment     |
| HEX5[6]        | PIN_N20 | On-Board seven-segment display segment     |
| Clk            | PIN_P11 | 50 MHz Clock from the on-board oscillators |
| Reset_Clear    | PIN_B8  | On-Board Push Button (KEY0)                |
| Run_Accumulate | PIN_A7  | On-Board Push Button (KEY1)                |
|                |         |                                            |

# V. <u>POST-LAB</u>

- 1) In the CSA for this lab, we asked you to create a 4x4 hierarchy. Is this ideal? If not, how would you go about designing the ideal hierarchy on the FPGA (what information would you need, what experiments would you do to figure out?)
- 2) For the adders, refer to the **Design Resources and Statistics** in IQT.16-18 and complete the following design statistics table for each adder. This is more comprehensive than the above design analysis and is required for every SystemVerilog circuit.

| LUT           |  |
|---------------|--|
| DSP           |  |
| Memory (BRAM) |  |
| Flip-Flop     |  |
| Frequency     |  |
| Static Power  |  |
| Dynamic Power |  |
| Total Power   |  |

Observe the data plot and provide explanation to the data, i.e., does each resource breakdown comparison from the plot makes sense? Are they complying with the theoretical design expectations, e.g., the maximum operating frequency of the carry-lookahead adder is higher than the carry-ripple adder? Which design consumes more power than the other as you expected, why?

## VI. REPORT

Write a report, you may follow the provided outline below, or make sure your own report outline includes at least the items enumerated below.

## 1. Introduction

a. Summarize the high-level function performed by the serial logic processor and the three adders.

#### 2. Adders

- a. Ripple Carry Adder
  - i. Written description of the architecture of the adder
  - ii. Block diagram.
- b. Carry Lookahead Adder
  - i. Written description of the architecture of the adder
  - ii. Describe how the P and G logic are used.
  - iii. Describe how you created the hierarchical 4x4 adder.
  - iv. Block diagram
    - 1. Block diagram inside a single CLA (4-bits)
    - 2. Block diagram of how each CLA was chained together.

## c. Carry Select Adder

- i. Written description of the architecture of the adder
- ii. Describe at a high level how the CSA speculatively computes multiple sums in parallel and rapidly chooses the correct one later. Make sure you understand this!
- iii. Block Diagram of the whole CSA circuit containing adders, multiplexers, and glue logic.
- d. Written description of all .SV modules, see the Appendix I of the previous lab for the specific formatting.
- e. Describe at a high level the area, complexity, and performance tradeoffs between the adders.
- f. Document the performance of each adder by creating a graph as specified in Prelab part C (page 4.6 in the manual).
- g. Optional for extra credits, perform a critical path analysis and compare this to the theoretical understanding of each adder.
- 3. Answers to the post-lab questions. As usual, they may be in their own section or dispersed into the appropriate sections in the rest of the report.

#### 4. Conclusion

- a. Describe any bugs and countermeasures taken during this lab.
- b. Was there anything ambiguous, incorrect, or unnecessarily difficult in the lab manual or given materials which can be improved for next semester? You can also specify what we did right, so it does not get changed.
- c. Any additional summary you want to include.